The purpose of this project is to analyze and compare unsupervised learning algorithms by applying them to two different classification problems. Two types of unsupervised learning algorithms analyzed in this project are Clustering Algorithms and Dimensionality Reduction algorithms.
Analysis will be carried out by implementing following steps on the two datasets -
Step 1: K means and GaussianMixtureModel clustering algorithms will be executed on selected datasets. Metrics such as SSE, Silhouette Co-efficients, Adjusted Mutual Information etc. will be compared to identified optimal cluster.
Step 2: PCA, FastICA, Randomized Projections and Random Forest Classifier will be applied on two datasets to conduct dimensionality reduction experiments.
Step 3: Both clustering algorithms will be executed on the datasets which were reduced using above four dimensionality reduction techniques.
Step 4: Neural Networks will be implemented on one of the datasets on which four dimensionality reduction techniques are implemented in Step 2 and results will be analyzed.
Step 5: In this step Neural Networks will be implemented on one of the datasets on which two clustering algorithms are implemented and results will be analyzed.
Problem: Given the physicochemical test results of a wine, predict whether the quality is wine is above average.
Dataset Details: UCI Wine Quality dataset consists physicochemical and sensory data red and white variants of the Portuguese "Vinho Verde" wine. Dataset is available at https://archive.ics.uci.edu/ml/datasets/Wine+Quality
Number of Instances: red wine - 1599; white wine - 4898.
Number of Attributes: 11 + output attribute
Why is this problem interesting?
Domain: Predicting the quality of the wine is crucial while determining its price. It is important to correctly identify above-average wines and hence classifiers must lower False Positive rate. It is acceptable if the classifier has a slightly higher False Negative rate. It will be interesting to see how the classifier implements this constraint.
Dataset: The target variable is the multi-class categorical variable and the only couple of classes are dominant. All the independent variables are numerical and there are no missing values in the dataset. Only a few features are slightly skewed. It will be interesting to see whether supervised learning algorithms show any bias/variance on such a clean dataset.
Observations from Supervised Learning algorithms: When supervised learning algorithms were executed on this dataset it was observed that, algorithms were quickly overfitting possible due to large number of attributes and less instances. It will be interesting to observe how reducing dimensions will impact accuracy. Also, original dataset has 9 classes whereas classes were reduced to 2 for supervised learning. It will also interesting to see how clusters algorithms create from this dataset.
Problem: Predict whether the income of the individual exceeds 50k a year based on the individual's census data.
Dataset Details: Dataset was extracted by Barry Becker from the 1994 Census database. A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0)). Dataset is available at https://archive.ics.uci.edu/ml/datasets/Census+Income
Number of Instances: 48842
Number of Attributes: 14 + output attribute
Why is this problem interesting?
Domain: Predicting the income of an individual has many practical applications across the financial industry. For example, banks can decide whether to approve a loan based on the predicted income of an individual. In such scenarios, it is crucial to lower False Positive rate and it may be acceptable to have a slightly higher False Negative rate. It will be interesting to see how various classifiers implement these constraints.
Dataset: High-level data exploration revealed that many of the independent features are categorical with many classes and some of these are qualitative while others are quantitative. Few of the features are highly skewed. It will be interesting to see how these features impact the performance of supervised algorithms. I also notice the missing values in some of the instances. I will also be interesting to see whether imputing or removing these instances will create any bias in the model.
Observations from Supervised Learning algorithms: During implementation of supervised learning algorithms it was observed that computation time of many algorithm was high. Multi-collinearity was also observed in this dataset. It will be interesting to observe how reducing dimensions will impact accuracy of the supervised learning algorithms..
#Packages Used
import itertools
import pandas as pd
import numpy as np
import utils
import time
import matplotlib.pyplot as plt
import matplotlib as mpl
import matplotlib.cm as cm
from scipy import linalg
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score, adjusted_mutual_info_score, adjusted_rand_score, homogeneity_completeness_v_measure
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.decomposition import PCA
from sklearn.decomposition import FastICA
from sklearn.random_projection import GaussianRandomProjection, SparseRandomProjection
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report
from sklearn.metrics import average_precision_score, precision_recall_curve
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import StratifiedKFold, train_test_split, cross_val_predict, cross_val_score, GridSearchCV
%matplotlib inline
#load red wine dataset
dataset_loc_red = "./datasets/wine/winequality-red.csv"
df_red = pd.read_csv(dataset_loc_red, sep=";")
#load white wine dataset
dataset_loc_white = "./datasets/wine/winequality-white.csv"
df_white = pd.read_csv(dataset_loc_white, sep=";")
#combine both datasets
df_combined = df_red.append(df_white, ignore_index=True)
# check for missing values
df_combined.isnull().values.any()
# check for feature data types
df_combined.info()
#Analyze Features
df_combined.describe()
#Ploting the class distribution
sns.countplot(x='quality', data = df_combined);
#plot correlation heat map
plt.figure(figsize =(10,10))
sns.heatmap(df_combined.corr(),annot=True)
plt.show()
# Categorize wine into "above average" and "below average" wines
df_combined["quality"].values[df_combined["quality"] <= 5] = 0
df_combined["quality"].values[df_combined["quality"] > 5] = 1
Wine_X = df_combined.iloc[:,:-1]
Wine_y = df_combined.iloc[:,-1]
Wine_X_Scaled = StandardScaler().fit_transform(Wine_X)
# Split the data into a training set and a test set
Wine_X_train, Wine_X_test, Wine_y_train, Wine_y_test = train_test_split(Wine_X_Scaled, Wine_y, random_state=1, test_size=0.30)
#load census dataset
colnames=['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']
dataset_loc = "./datasets/census/adult.data"
df_census = pd.read_csv(dataset_loc, names=colnames, header=None, skipinitialspace=True)
# check for missing values
df_census.isnull().values.any()
# check for feature data types
df_census.info()
#Analyze Features
df_census.describe()
# Replace missing values "?" with NAs
df_census.replace('?', np.nan, inplace=True)
df_census.dropna()
df_census.dropna(axis=0, inplace=True)
# converting target variable.
# df_census['income'] = df_census['income'].apply(lambda x: 1 if x=='>50K' else 0)
df_census["income"].values[df_census["income"] == '<=50K'] = 0
df_census["income"].values[df_census["income"] == '>50K'] = 1
#Feature Encoding
le = preprocessing.LabelEncoder()
for i in range(0,df_census.shape[1]):
if df_census.dtypes[i]=='object':
df_census[df_census.columns[i]] = le.fit_transform(df_census[df_census.columns[i]])
#Ploting the class distribution
sns.countplot(x='income', data = df_census);
Census_X = df_census.iloc[:,:-1]
Census_y = df_census.iloc[:,-1]
Census_X_Scaled = StandardScaler().fit_transform(Census_X)
# Split the data into a training set and a test set
Census_X_train, Census_X_test, Census_y_train, Census_y_test = train_test_split(Census_X_Scaled, Census_y, random_state=1, test_size=0.30)
K-Means is a popular unsupervised machine learning algorithm which clusters data points together based on certain similarities. K refers to the number of desired clusters. Data point is included in a cluster if it is closer to that cluster's centroid than centroid of any other cluster. K-Means algorithm can converge to a local minimum. K-Means finds the best centroids by assigning data points to clusters based on the current centroids and then choosing centroids based on the current assignment of data points to clusters.
Distance Measure: Choice of the distance measure will greatly influence the performance of the K-Means and Expectation Maximization algorithms. There is no free lunch i.e. the nest distance measure for all algorithms and choice of distance measure depends on the dataset and problem type. Since both of out problems are binary classification and we have clean and compact dataset with minimal outliers we will use Euclidean distance. Mahala Nobis measure is computationally expensive whereas Cosine measure is suitable for document similarity hence they are not used. Manhattan distance generally works only if the points are arranged in the form of a grid and it ignores geometric distance. In K-Means we are minimizing sum of squares of distances which will increase Euclidean distance. If we use Manhattan distance, we cannot prove that decreasing sum of squares of distances will give us max Manhattan distance. In this project both Kmeans and EM algorithms are analyzed on the K value between 2 and 14.
Although clustering and dimensionality reduction are unsupervised learning methods which do not require labelled data and test dataset, dataset was split into test and train for future use to analyze performance of neural network. For clustering and dimensionality reduction experiment in next steps I used only training set for both datasets.
model_stats = []
kclusters = range(2, 15)
for k in kclusters:
model = KMeans(n_clusters=k)
t_start = time.time()*1000.0
model.fit(Wine_X_Scaled)
predcls = model.predict(Wine_X_Scaled)
t_stop = time.time()*1000.0
exectime = t_stop - t_start
modelid = "kmeans"
centroids = model.cluster_centers_
sse = model.inertia_
adjmutinfo = adjusted_mutual_info_score(Wine_y,predcls)
adjrandinfo = adjusted_rand_score(Wine_y,predcls)
homogene, complete, vmeasure = homogeneity_completeness_v_measure(Wine_y,predcls)
model_stats.append([modelid, k, predcls, sse, centroids, adjmutinfo, adjrandinfo, homogene, complete, vmeasure, exectime])
dfstatsKM = pd.DataFrame(model_stats, columns=['model_id', 'k', 'predcls', 'sse', 'centroids', 'adjmutinfo', 'adjrandinfo', 'homogene', 'complete', 'vmeasure', 'exectime'])
model_stats = []
kclusters = range(2, 15)
for k in kclusters:
model = KMeans(n_clusters=k)
t_start = time.time()*1000.0
model.fit(Census_X_Scaled)
predcls = model.predict(Census_X_Scaled)
t_stop = time.time()*1000.0
exectime = t_stop - t_start
modelid = "kmeans"
centroids = model.cluster_centers_
sse = model.inertia_
adjmutinfo = adjusted_mutual_info_score(Census_y,predcls)
adjrandinfo = adjusted_rand_score(Census_y,predcls)
homogene, complete, vmeasure = homogeneity_completeness_v_measure(Census_y,predcls)
model_stats.append([modelid, k, predcls, sse, centroids, adjmutinfo, adjrandinfo, homogene, complete, vmeasure, exectime])
dfstatsCensusKM = pd.DataFrame(model_stats, columns=['model_id', 'k', 'predcls', 'sse', 'centroids', 'adjmutinfo', 'adjrandinfo', 'homogene', 'complete', 'vmeasure', 'exectime'])
# dfstatsCensusKM
Algorithm was executed multiple times with increasing number of cluster and then chart was plotted for within-cluster distances against cluster count. It was expected that within-cluster distance will fall sharply for initial cluster counts but curve will flatten out creating a elbow pattern. Optimal value of K is the point where elbow bends and within-cluster distance stops following sharply.
plt.figure(figsize=(6, 6))
plt.plot(dfstatsKM['k'],dfstatsKM['sse'], '-o')
plt.xlabel('Number of clusters')
plt.ylabel('Sum of squared distance')
plt.minorticks_on()
plt.grid(b=True, which='major', color='k', linestyle='-', alpha=0.1)
plt.grid(b=True, which='minor', color='r', linestyle='-', alpha=0.05)
plt.title("Wine - Sum of Squared Distance")
plt.show()
plt.figure(figsize=(6, 6))
plt.plot(dfstatsCensusKM['k'],dfstatsCensusKM['sse'], '-o')
plt.xlabel('Number of clusters')
plt.ylabel('Sum of squared distance')
plt.minorticks_on()
plt.grid(b=True, which='major', color='k', linestyle='-', alpha=0.1)
plt.grid(b=True, which='minor', color='r', linestyle='-', alpha=0.05)
plt.title("Census - Sum of Squared Distance")
plt.show()
Observation: As number of clusters increases, sum of squared distances decreases which leads to more generalized clusters at the lower value of K. It can be observed that for both datasets the elbow is not prominent. Although within-cluster distance falls sharply but it continues to fall as number of clusters increases. Both graphs do not give us clear value of K and hence Silhouette analysis was carried out.
Silhouette co-efficient measures how similar an object is to its own cluster as compared to other clusters. As compared to elbow method Silhouette chart provide clear representation of clusters and how well each object is classified. A Silhouette co-efficient of +1 indicates that the object closes to its own cluster and far away from neighboring cluster. Whereas value of -1 indicates that the object is far from its own cluster and close to neighboring cluster. A value of 0 suggest that object is at the boundary of the of two clusters. We prefer higher value of Silhouette co-efficient.
# Below code to plot Silhouette chart was adapted from scikit learn website.
# https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html
for n_clusters in range(2, 15):
# Create a subplot with 1 row and 2 columns
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(18, 7)
# The 1st subplot is the silhouette plot
# The silhouette coefficient can range from -1, 1 but in this example all
# lie within [-0.1, 1]
ax1.set_xlim([-0.1, 1])
# The (n_clusters+1)*10 is for inserting blank space between silhouette
# plots of individual clusters, to demarcate them clearly.
ax1.set_ylim([0, len(Wine_X_Scaled) + (n_clusters + 1) * 10])
# Initialize the clusterer with n_clusters value and a random generator
# seed of 10 for reproducibility.
clusterer = KMeans(n_clusters=n_clusters, random_state=10)
cluster_labels = clusterer.fit_predict(Wine_X_Scaled)
# The silhouette_score gives the average value for all the samples.
# This gives a perspective into the density and separation of the formed
# clusters
silhouette_avg = silhouette_score(Wine_X_Scaled, cluster_labels)
print("For n_clusters =", n_clusters,
"The average silhouette_score is :", silhouette_avg)
# Compute the silhouette scores for each sample
sample_silhouette_values = silhouette_samples(Wine_X_Scaled, cluster_labels)
y_lower = 10
for i in range(n_clusters):
# Aggregate the silhouette scores for samples belonging to
# cluster i, and sort them
ith_cluster_silhouette_values = \
sample_silhouette_values[cluster_labels == i]
ith_cluster_silhouette_values.sort()
size_cluster_i = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i
color = cm.nipy_spectral(float(i) / n_clusters)
ax1.fill_betweenx(np.arange(y_lower, y_upper),
0, ith_cluster_silhouette_values,
facecolor=color, edgecolor=color, alpha=0.7)
# Label the silhouette plots with their cluster numbers at the middle
ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
# Compute the new y_lower for next plot
y_lower = y_upper + 10 # 10 for the 0 samples
ax1.set_title("The silhouette plot for the various clusters.")
ax1.set_xlabel("The silhouette coefficient values")
ax1.set_ylabel("Cluster label")
# The vertical line for average silhouette score of all the values
ax1.axvline(x=silhouette_avg, color="red", linestyle="--")
ax1.set_yticks([]) # Clear the yaxis labels / ticks
ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])
# 2nd Plot showing the actual clusters formed
colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)
ax2.scatter(Wine_X_Scaled[:, 0], Wine_X_Scaled[:, 1], marker='.', s=30, lw=0, alpha=0.7,
c=colors, edgecolor='k')
# Labeling the clusters
centers = clusterer.cluster_centers_
# Draw white circles at cluster centers
ax2.scatter(centers[:, 0], centers[:, 1], marker='o',
c="white", alpha=1, s=200, edgecolor='k')
for i, c in enumerate(centers):
ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1,
s=50, edgecolor='k')
ax2.set_title("The visualization of the clustered data.")
ax2.set_xlabel("Feature space for the 1st feature")
ax2.set_ylabel("Feature space for the 2nd feature")
plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
"with n_clusters = %d" % n_clusters),
fontsize=14, fontweight='bold')
plt.show()
Observations: As wine dataset is a labeled dataset with labels ranging from 1 to 9, I expected 7-9 clusters to give the highest Silhouette score. It was observed that highest Silhouette score of 0.509 was achieved with two clusters whereas Silhouette score for cluster 7 and above was very low. Analysis of scatter plot for first and second features indicate that objects are less separable as value of K increases. Even when k is 2 points of first and second features are not clearly separable which is also indicated by low average Silhouette score.
# Below code to plot Silhouette chart was adapted from scikit learn website.
# https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html
for n_clusters in range(2, 15):
# Create a subplot with 1 row and 2 columns
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(18, 7)
# The 1st subplot is the silhouette plot
# The silhouette coefficient can range from -1, 1 but in this example all
# lie within [-0.1, 1]
ax1.set_xlim([-0.1, 1])
# The (n_clusters+1)*10 is for inserting blank space between silhouette
# plots of individual clusters, to demarcate them clearly.
ax1.set_ylim([0, len(Census_X_Scaled) + (n_clusters + 1) * 10])
# Initialize the clusterer with n_clusters value and a random generator
# seed of 10 for reproducibility.
clusterer = KMeans(n_clusters=n_clusters, random_state=10)
cluster_labels = clusterer.fit_predict(Census_X_Scaled)
# The silhouette_score gives the average value for all the samples.
# This gives a perspective into the density and separation of the formed
# clusters
silhouette_avg = silhouette_score(Census_X_Scaled, cluster_labels)
print("For n_clusters =", n_clusters,
"The average silhouette_score is :", silhouette_avg)
# Compute the silhouette scores for each sample
sample_silhouette_values = silhouette_samples(Census_X_Scaled, cluster_labels)
y_lower = 10
for i in range(n_clusters):
# Aggregate the silhouette scores for samples belonging to
# cluster i, and sort them
ith_cluster_silhouette_values = \
sample_silhouette_values[cluster_labels == i]
ith_cluster_silhouette_values.sort()
size_cluster_i = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i
color = cm.nipy_spectral(float(i) / n_clusters)
ax1.fill_betweenx(np.arange(y_lower, y_upper),
0, ith_cluster_silhouette_values,
facecolor=color, edgecolor=color, alpha=0.7)
# Label the silhouette plots with their cluster numbers at the middle
ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
# Compute the new y_lower for next plot
y_lower = y_upper + 10 # 10 for the 0 samples
ax1.set_title("The silhouette plot for the various clusters.")
ax1.set_xlabel("The silhouette coefficient values")
ax1.set_ylabel("Cluster label")
# The vertical line for average silhouette score of all the values
ax1.axvline(x=silhouette_avg, color="red", linestyle="--")
ax1.set_yticks([]) # Clear the yaxis labels / ticks
ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])
# 2nd Plot showing the actual clusters formed
colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)
ax2.scatter(Census_X_Scaled[:, 0], Census_X_Scaled[:, 1], marker='.', s=30, lw=0, alpha=0.7,
c=colors, edgecolor='k')
# Labeling the clusters
centers = clusterer.cluster_centers_
# Draw white circles at cluster centers
ax2.scatter(centers[:, 0], centers[:, 1], marker='o',
c="white", alpha=1, s=200, edgecolor='k')
for i, c in enumerate(centers):
ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1,
s=50, edgecolor='k')
ax2.set_title("The visualization of the clustered data.")
ax2.set_xlabel("Feature space for the 1st feature")
ax2.set_ylabel("Feature space for the 2nd feature")
plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
"with n_clusters = %d" % n_clusters),
fontsize=14, fontweight='bold')
plt.show()
Observations: Census dataset is a binary dataset, I expected 2 clusters to give the highest Silhouette score. It was observed that highest Silhouette score of 0.584 was achieved with two clusters and average score dropped as number of clusters increased. Scatter plot shows first and second features which are both categorical. Negative silhouette scores are observed for both datasets at high value of K which indicates overlapping clusters. Similarly, some of clusters at higher value of k has thin which indicate some feature have very less impact. It was also observed that Silhouette analysis is computationally expensive as compared to other validation methods.
K-means is hard clustering method in which each data point is assigned to one and only one cluster. On the other hand Expectation Maximization is a soft clustering method which estimates the parameters by maximizing the log-likelihood of the observed data point. This allows EM to assign data points to more than one cluster based on the posterior probabilities of the data points. Like K-means, this algorithm may converge to local optima but if there is a single global optimum this algorithm is guaranteed to global optima. Expectation Maximization was implemented using GaussianMixture model function in sklearn.
model_stats = []
kclusters = range(2, 15)
for k in kclusters:
model = GaussianMixture(n_components=k)
t_start = time.time()*1000.0
model.fit(Wine_X_Scaled)
predcls = model.predict(Wine_X_Scaled)
t_stop = time.time()*1000.0
exectime = t_stop - t_start
modelid = "gmm"
centroids = model.means_
loglike = model.score(Wine_X_Scaled)
silscores = silhouette_samples(Wine_X_Scaled, predcls)
adjmutinfo = adjusted_mutual_info_score(Wine_y,predcls)
adjrandinfo = adjusted_rand_score(Wine_y,predcls)
homogene, complete, vmeasure = homogeneity_completeness_v_measure(Wine_y,predcls)
aic = model.aic(Wine_X_Scaled)
bic = model.bic(Wine_X_Scaled)
model_stats.append([modelid, k, predcls, loglike, silscores, centroids, adjmutinfo, adjrandinfo, homogene, complete, vmeasure, aic, bic, exectime])
dfstatsEM = pd.DataFrame(model_stats, columns=['model_id', 'k', 'predcls', 'loglike', 'silscore', 'centroids', 'adjmutinfo', 'adjrandinfo', 'homogene', 'complete', 'vmeasure', 'aic', 'bic', 'exectime'])
model_stats = []
kclusters = range(2, 15)
for k in kclusters:
model = GaussianMixture(n_components=k, covariance_type='diag')
t_start = time.time()*1000.0
model.fit(Census_X_Scaled)
predcls = model.predict(Census_X_Scaled)
t_stop = time.time()*1000.0
exectime = t_stop - t_start
modelid = "gmm"
centroids = model.means_
loglike = model.score(Census_X_Scaled)
silscores = silhouette_samples(Census_X_Scaled, predcls)
adjmutinfo = adjusted_mutual_info_score(Census_y,predcls)
adjrandinfo = adjusted_rand_score(Census_y,predcls)
homogene, complete, vmeasure = homogeneity_completeness_v_measure(Census_y,predcls)
aic = model.aic(Census_X_Scaled)
bic = model.bic(Census_X_Scaled)
model_stats.append([modelid, k, predcls, loglike, silscores, centroids, adjmutinfo, adjrandinfo, homogene, complete, vmeasure, aic, bic, exectime])
dfstatsCensusEM = pd.DataFrame(model_stats, columns=['model_id', 'k', 'predcls', 'loglike', 'silscore', 'centroids', 'adjmutinfo', 'adjrandinfo', 'homogene', 'complete', 'vmeasure', 'aic', 'bic', 'exectime'])
BIC (Bayesian Information Criterion) is used to carry our analysis of Expectation Maximization algorithms which yield optimal value of K. As compared to AIC, BIC tends to have heavier penalty for additional parameters. BIC was chosen to evaluate EM models because we want to be more stringent in picking feature and avoid overfitting. AIC might be a better choice for dataset exploration.
#Code for below plot was adaoted from sklearn website
#https://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_selection.html
lowest_bic = np.infty
bic = []
n_components_range = range(2, 15)
cv_types = ['spherical', 'tied', 'diag', 'full']
for cv_type in cv_types:
for n_components in n_components_range:
# Fit a Gaussian mixture with EM
gmm = GaussianMixture(n_components=n_components,
covariance_type=cv_type)
gmm.fit(Wine_X_Scaled)
bic.append(gmm.bic(Wine_X_Scaled))
if bic[-1] < lowest_bic:
lowest_bic = bic[-1]
best_gmm = gmm
bic = np.array(bic)
color_iter = itertools.cycle(['navy', 'turquoise', 'cornflowerblue',
'darkorange'])
clf = best_gmm
bars = []
# Plot the BIC scores
plt.figure(figsize=(8, 6))
spl = plt.subplot(2, 1, 1)
for i, (cv_type, color) in enumerate(zip(cv_types, color_iter)):
xpos = np.array(n_components_range) + .2 * (i - 2)
bars.append(plt.bar(xpos, bic[i * len(n_components_range):
(i + 1) * len(n_components_range)],
width=.2, color=color))
plt.xticks(n_components_range)
plt.ylim([bic.min() * 1.01 - .01 * bic.max(), bic.max()])
plt.title('Wine - BIC score per model')
xpos = np.mod(bic.argmin(), len(n_components_range)) + .65 +\
.2 * np.floor(bic.argmin() / len(n_components_range))
plt.text(xpos, bic.min() * 0.97 + .03 * bic.max(), '*', fontsize=14)
spl.set_xlabel('Number of components')
spl.legend([b[0] for b in bars], cv_types)
Observation: BIC score was calculated for 4 different covariance types. It was observed that BIC score decreases with the increase in number of components until component count reached 12.Lowest BIC score was observed for 'full' covariance type and it starts increasing after 14 component.
#Code for below plot was adaoted from sklearn website
#https://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_selection.html
lowest_bic = np.infty
bic = []
n_components_range = range(2, 15)
cv_types = ['spherical', 'tied', 'diag', 'full']
for cv_type in cv_types:
for n_components in n_components_range:
# Fit a Gaussian mixture with EM
gmm = GaussianMixture(n_components=n_components,
covariance_type=cv_type, reg_covar=0.001)
gmm.fit(Census_X_Scaled)
bic.append(gmm.bic(Census_X_Scaled))
if bic[-1] < lowest_bic:
lowest_bic = bic[-1]
best_gmm = gmm
bic = np.array(bic)
color_iter = itertools.cycle(['navy', 'turquoise', 'cornflowerblue',
'darkorange'])
clf = best_gmm
bars = []
# Plot the BIC scores
plt.figure(figsize=(8, 6))
spl = plt.subplot(2, 1, 1)
for i, (cv_type, color) in enumerate(zip(cv_types, color_iter)):
xpos = np.array(n_components_range) + .2 * (i - 2)
bars.append(plt.bar(xpos, bic[i * len(n_components_range):
(i + 1) * len(n_components_range)],
width=.2, color=color))
plt.xticks(n_components_range)
plt.ylim([bic.min() * 1.01 - .01 * bic.max(), bic.max()])
plt.title('Census - BIC score per model')
xpos = np.mod(bic.argmin(), len(n_components_range)) + .65 +\
.2 * np.floor(bic.argmin() / len(n_components_range))
plt.text(xpos, bic.min() * 0.97 + .03 * bic.max(), '*', fontsize=14)
spl.set_xlabel('Number of components')
spl.legend([b[0] for b in bars], cv_types)
Observation: For census dataset also 4 different covariance types were used to compare BIC score. It was observed that BIC score decreases with the increase in number of components. Lowest BIC score was observed for 'diag' covariance type with component count as 12.
Since both of the previous steps did not match the expected optimal cluster count, further analysis was carried out by leveraging the available labels for each datasets. Following evaluation metrics are used to conduct external measurement of clustering results -
fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(10,10));
axs[0][0].plot(kclusters, dfstatsKM["adjmutinfo"], color='blue', linestyle='-', marker='o', label = "KM Adj Mutual Info")
axs[0][0].plot(kclusters, dfstatsEM["adjmutinfo"], color='red', linestyle='-', marker='o', label = "EM Adj Mutual Info")
axs[0][0].set_title('Wine - Adj Mutual Info')
axs[0][1].plot(kclusters, dfstatsKM["adjrandinfo"], color='green', linestyle='-', marker='o', label = "KM Adj Random Info")
axs[0][1].plot(kclusters, dfstatsEM["adjrandinfo"], color='black', linestyle='-', marker='o', label = "EM Adj Random Info")
axs[0][1].set_title('Wine - Adj Random Info')
axs[1][0].plot(kclusters, dfstatsKM["vmeasure"], color='black', linestyle='-', marker='o', label = "KM V Measure")
axs[1][0].plot(kclusters, dfstatsEM["vmeasure"], color='red', linestyle='-', marker='o', label = "EM V Measure")
axs[1][0].set_title('Wine - V Measure')
axs[1][1].plot(kclusters, dfstatsKM["exectime"], color='green', linestyle='-', marker='o', label = "KM Execution Time")
axs[1][1].plot(kclusters, dfstatsEM["exectime"], color='blue', linestyle='-', marker='o', label = "EM Execution Time")
axs[1][1].set_title('Wine - Execution Time')
for ax in axs.flat:
ax.set(xlabel='# of Clusters', ylabel='Score')
ax.legend(loc='best')
ax.minorticks_on()
ax.grid(b=True, which='major', color='k', linestyle='-', alpha=0.1)
ax.grid(b=True, which='minor', color='r', linestyle='-', alpha=0.05)
fig.tight_layout()
Observation: For wine dataset, GMM clearly performs better as compared to K-means. On all the collected metrics, score was highest for EM when cluster count is 4. K-means assumes the clusters as spherical whereas GMM does not assume clusters to be of any geometric shape and hence works well with non-linear geometric distributions. Hence it outperformed K-means on the wine dataset. It is also observed that computational efficiency of K-means and EM is very close which can possibly attributed to small size of the dataset.
fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(10,10));
axs[0][0].plot(kclusters, dfstatsCensusKM["adjmutinfo"], color='blue', linestyle='-', marker='o', label = "KM Adj Mutual Info")
axs[0][0].plot(kclusters, dfstatsCensusEM["adjmutinfo"], color='red', linestyle='-', marker='o', label = "EM Adj Mutual Info")
axs[0][0].set_title('Census - Adj Mutual Info')
axs[0][1].plot(kclusters, dfstatsCensusKM["adjrandinfo"], color='green', linestyle='-', marker='o', label = "KM Adj Random Info")
axs[0][1].plot(kclusters, dfstatsCensusEM["adjrandinfo"], color='black', linestyle='-', marker='o', label = "EM Adj Random Info")
axs[0][1].set_title('Census - Adj Random Info')
axs[1][0].plot(kclusters, dfstatsCensusKM["vmeasure"], color='black', linestyle='-', marker='o', label = "KM V Measure")
axs[1][0].plot(kclusters, dfstatsCensusEM["vmeasure"], color='red', linestyle='-', marker='o', label = "EM V Measure")
axs[1][0].set_title('Census - V Measure')
axs[1][1].plot(kclusters, dfstatsCensusKM["exectime"], color='green', linestyle='-', marker='o', label = "KM Execution Time")
axs[1][1].plot(kclusters, dfstatsCensusEM["exectime"], color='blue', linestyle='-', marker='o', label = "EM Execution Time")
axs[1][1].set_title('Census - Execution Time')
for ax in axs.flat:
ax.set(xlabel='# of Clusters', ylabel='Score')
ax.legend(loc='best')
ax.minorticks_on()
ax.grid(b=True, which='major', color='k', linestyle='-', alpha=0.1)
ax.grid(b=True, which='minor', color='r', linestyle='-', alpha=0.05)
fig.tight_layout()
Observation: For census dataset also GMM clearly performs better as compared to K-means. On all the collected metrics, score was highest for GMM when cluster count is 3. It can be inferred that model has some hidden parameters which as not easily observable and hence K-means performed poorly as compared to GMM.
Dimension reduction is the process of reducing the number of random variables by identifying a set of principal variables. Four dimensionality reduction techniques are implemented to analyze their performance on the wine and census datasets. Objective is to identify optimal dimensions such that minimal information is lost at the same time efficiency of the algorithm is improved.
PCA simplifies the complexity in high-dimensional data by identifying the correlation between variable. PCA attempt to reduce the dimensionality if a strong correlation exists between variables. PCA method available in sklearn.decomposition module is implemented to analyze the principle components. As a first step the data was standardized. Then Eigenvectors and Eigenvalues were obtained from the covariance matrix. Set of eigenvectors that correspond to the largest eigenvalues were selected by sorting eigenvectors. After that the projection matrix was constructed from the selected eigenvectors. This project matrix was used to reduce original dataset.
pca = PCA()
pca.fit_transform(Wine_X_Scaled)
expvarratio = pca.explained_variance_ratio_
ncomponent = expvarratio.size
cumvariance = np.cumsum(expvarratio)
eigenvalues = pca.singular_values_
Wine_X_proj = pca.inverse_transform(Wine_X_Scaled)
loss = ((Wine_X_Scaled - Wine_X_proj) ** 2).mean()
fig = plt.figure(figsize=(10, 6))
ax1 = fig.add_subplot(111)
ax1.plot(list(range(cumvariance.size)), cumvariance, color='blue', linestyle='-', marker='o', label = "KM Adj Mutual Info")
ax1.set_ylabel('Cumulative Explained Variance Ratio')
plt.xlabel('Components')
plt.ylabel('Cumulative Explained Variance Ratio')
plt.minorticks_on()
plt.grid(b=True, which='major', color='k', linestyle='-', alpha=0.1)
plt.grid(b=True, which='minor', color='r', linestyle='-', alpha=0.05)
ax2 = ax1.twinx()
ax2.plot(list(range(eigenvalues.size)), eigenvalues, color='blue', linestyle='-', label = "Eigenvalues")
ax2.set_ylabel('Eigenvalues', color='r')
for tl in ax2.get_yticklabels():
tl.set_color('r')
pca = PCA()
pca.fit_transform(Census_X_Scaled)
expvarratio = pca.explained_variance_ratio_
ncomponent = expvarratio.size
cumvariance = np.cumsum(expvarratio)
eigenvalues = pca.singular_values_
Census_X_proj = pca.inverse_transform(Census_X_Scaled)
loss = ((Census_X_Scaled - Census_X_proj) ** 2).mean()
fig = plt.figure(figsize=(10, 6))
ax1 = fig.add_subplot(111)
ax1.plot(list(range(cumvariance.size)), cumvariance, color='blue', linestyle='-', marker='o', label = "KM Adj Mutual Info")
ax1.set_ylabel('Cumulative Explained Variance Ratio')
plt.xlabel('Components')
plt.ylabel('Cumulative Explained Variance Ratio')
plt.minorticks_on()
plt.grid(b=True, which='major', color='k', linestyle='-', alpha=0.1)
plt.grid(b=True, which='minor', color='r', linestyle='-', alpha=0.05)
ax2 = ax1.twinx()
ax2.plot(list(range(eigenvalues.size)), eigenvalues, color='blue', linestyle='-', label = "Eigenvalues")
ax2.set_ylabel('Eigenvalues', color='r')
for tl in ax2.get_yticklabels():
tl.set_color('r')
Observation: A scree plot is used to check whether PCA works well on the wine dataset or not. As principal components capture the amount of variation, we can select principal components which capture most of the information and ignore the rest without losing any important information. Scree plot for wine dataset is not ideal (elbow is not prominent) and hence principal components from variance plot are selected which describes at least 80% of the variance. For wine dataset 4 components describe at least 80% of the variance. It can also be observed from scree plot that eigenvalues decrease sharply for first four components and flattens out till 9 components. For census dataset, first 9 components describe at least 80% of the variance.
pcaW = PCA(n_components = 4).fit(Wine_X_train)
Wine_X_train_PCA = pcaW.transform(Wine_X_train)
Wine_X_test_PCA = pcaW.transform(Wine_X_test)
pcaC = PCA(n_components = 9).fit(Census_X_train)
Census_X_train_PCA = pcaC.transform(Census_X_train)
Census_X_test_PCA = pcaC.transform(Census_X_test)
ICA algorithm reveals underlying hidden factors and decompose linear mixtures of signals into their independent components. ICA algorithm reveals underlying hidden factors and decompose linear mixtures of signals into their independent components. ICA algorithm carries out the whitening of the data by removing correlations in the data. ICA creates independent components by decomposing multivariate signal. It achieves this by orthogonal rotation and maximize non-gaussian (kurtosis) between components. FastICA method available in sklearn.decomposition module is implemented to analyze the kurtosis.
ICA algorithm can be called as extension of PCA, but it performs better than PCA because it looks at each feature independently. Performance of ICA is impacted if all features contribute in building model.
compnts = list(np.arange(1,(Wine_X_Scaled.shape[1]-1),1))
kurtosis = []
for c in compnts:
ica = FastICA(n_components=c)
Wineica = ica.fit_transform(Wine_X_Scaled)
tmpdf = pd.DataFrame(Wineica)
tmpdf = tmpdf.kurt(axis=0)
kurtosis.append(tmpdf.abs().mean())
plt.figure(figsize=(6, 6))
plt.plot(compnts, kurtosis, color='blue', linestyle='-', marker='o')
plt.xlabel('Components')
plt.ylabel('Kurtosis')
plt.minorticks_on()
plt.grid(b=True, which='major', color='k', linestyle='-', alpha=0.1)
plt.grid(b=True, which='minor', color='r', linestyle='-', alpha=0.05)
plt.title("Wine - Kurtosis vs Components")
plt.show()
compnts = list(np.arange(1,(Census_X_Scaled.shape[1]),1))
kurtosis = []
for c in compnts:
ica = FastICA(n_components=c)
Censusica = ica.fit_transform(Census_X_Scaled)
tmpdf = pd.DataFrame(Censusica)
tmpdf = tmpdf.kurt(axis=0)
kurtosis.append(tmpdf.abs().mean())
plt.figure(figsize=(6, 6))
plt.plot(compnts, kurtosis, color='blue', linestyle='-', marker='o')
plt.xlabel('Components')
plt.ylabel('Kurtosis')
plt.minorticks_on()
plt.grid(b=True, which='major', color='k', linestyle='-', alpha=0.1)
plt.grid(b=True, which='minor', color='r', linestyle='-', alpha=0.05)
plt.title("Census - Kurtosis vs Components")
plt.show()
Observation: For wine dataset the number of components with highest kurtosis is 8 whereas for census dataset the the number of components with highest kurtosis is 7.
icaW = FastICA(n_components = 8).fit(Wine_X_train)
Wine_X_train_ICA = icaW.transform(Wine_X_train)
Wine_X_test_ICA = icaW.transform(Wine_X_test)
icaC = FastICA(n_components = 7).fit(Census_X_train)
Census_X_train_ICA = icaC.transform(Census_X_train)
Census_X_test_ICA = icaC.transform(Census_X_test)
Random Projection is another dimensionality reduction algorithm which projects the input space randomly to lower dimensional subspace. This algorithm is powerful, simple and it has low error rate as compared to other methods. GaussianRandomProjection method available in sklearn's random_projection module is used to analyze reconstruction error for both datasets. This algorithm reduce dimensionality through Gaussian random projection.
compnts = list(np.arange(1,(Wine_X_Scaled.shape[1]),1))
err = []
for c in compnts:
rp = GaussianRandomProjection(n_components=c)
Winerp = rp.fit_transform(Wine_X_Scaled)
recon = np.dot(Winerp, rp.components_)
reconerr = np.mean((Wine_X_Scaled - recon)**2)
err.append(reconerr)
plt.figure(figsize=(6, 6))
plt.plot(compnts, err, color='blue', linestyle='-', marker='o')
plt.xlabel('Components')
plt.ylabel('Reconstruction Error')
plt.minorticks_on()
plt.grid(b=True, which='major', color='k', linestyle='-', alpha=0.1)
plt.grid(b=True, which='minor', color='r', linestyle='-', alpha=0.05)
plt.title("Wine - Reconstruction Error vs Components")
plt.show()
compnts = list(np.arange(1,(Census_X_Scaled.shape[1]),1))
err = []
for c in compnts:
rp = GaussianRandomProjection(n_components=c)
Censusrp = rp.fit_transform(Census_X_Scaled)
recon = np.dot(Censusrp, rp.components_)
reconerr = np.mean((Census_X_Scaled - recon)**2)
err.append(reconerr)
plt.figure(figsize=(6, 6))
plt.plot(compnts, err, color='blue', linestyle='-', marker='o')
plt.xlabel('Components')
plt.ylabel('Reconstruction Error')
plt.minorticks_on()
plt.grid(b=True, which='major', color='k', linestyle='-', alpha=0.1)
plt.grid(b=True, which='minor', color='r', linestyle='-', alpha=0.05)
plt.title("Census - Reconstruction Error vs Components")
plt.show()
Observation: Analysis of Reconstruction Error for wine and census dataset reveals that the optimal number of components for wine dataset is 4 whereas optimal component for census dataset is 5.
rpW = GaussianRandomProjection(n_components = 2).fit(Wine_X_train)
Wine_X_train_RP = rpW.transform(Wine_X_train)
Wine_X_test_RP = rpW.transform(Wine_X_test)
rpC = GaussianRandomProjection(n_components = 7).fit(Census_X_train)
Census_X_train_RP = rpC.transform(Census_X_train)
Census_X_test_RP = rpC.transform(Census_X_test)
Besides classification, Random Forest Classifier can also we used to reduce dimensionality. Random Forest Classifier can analyze each feature and assess its importance by finding out how much each feature contribute to the model’s information. This algorithm does not assume that data is linearly separable. Random Forest count how many times each feature has been selected for a split and generates a score. This score helps in identifying most predictive features.
Sklearn's RandomForestClassifier is used to assess the importance of features for both datasets.
#Below code is adapted from -
#https://github.com/Einsteinish/bogotobogo-Machine-Learning/blob/master/scikit_machine_learning_Data_Processing-III-Sequential-Feature-Selection.ipynb
feat_labels = df_combined.columns[1:]
forest = RandomForestClassifier(max_depth=10, ccp_alpha=0.007, criterion='entropy')
forest.fit(Wine_X_Scaled, Wine_y)
importances = forest.feature_importances_
indices = np.argsort(importances)[::-1]
for f in range(Wine_X_Scaled.shape[1]):
print("%2d) %-*s %f" % (f + 1, 30, feat_labels[f],importances[indices[f]]))
plt.title('Wine - Feature Importances')
plt.bar(range(Wine_X_Scaled.shape[1]), importances[indices],
color='green', align='center')
plt.xticks(range(Wine_X_Scaled.shape[1]), feat_labels, rotation=90)
plt.xlim([-1, Wine_X_Scaled.shape[1]])
plt.tight_layout()
plt.show()
sfm = SelectFromModel(forest, threshold=0.05)
sfm.fit(Wine_X_train, Wine_y_train)
for feature_list_index in sfm.get_support(indices=True):
print(feat_labels[feature_list_index])
Wine_X_train_RF = sfm.transform(Wine_X_train)
Wine_X_test_RF = sfm.transform(Wine_X_test)
#Below code is adapted from -
#https://github.com/Einsteinish/bogotobogo-Machine-Learning/blob/master/scikit_machine_learning_Data_Processing-III-Sequential-Feature-Selection.ipynb
feat_labels = df_census.columns[1:]
forest = RandomForestClassifier(max_depth=10, ccp_alpha=0.002, criterion='entropy')
forest.fit(Census_X_Scaled, Census_y)
importances = forest.feature_importances_
indices = np.argsort(importances)[::-1]
for f in range(Census_X_Scaled.shape[1]):
print("%2d) %-*s %f" % (f + 1, 30, feat_labels[f],importances[indices[f]]))
plt.title('Census - Feature Importances')
plt.bar(range(Census_X_Scaled.shape[1]), importances[indices],
color='green', align='center')
plt.xticks(range(Census_X_Scaled.shape[1]), feat_labels, rotation=90)
plt.xlim([-1, Census_X_Scaled.shape[1]])
plt.tight_layout()
plt.show()
sfm = SelectFromModel(forest, threshold=0.05)
sfm.fit(Census_X_train, Census_y_train)
for feature_list_index in sfm.get_support(indices=True):
print(feat_labels[feature_list_index])
Census_X_train_RF = sfm.transform(Census_X_train)
Census_X_test_RF = sfm.transform(Census_X_test)
Observation: For wine dataset three features with highest importance are citric acid. free sulfur dioxide, pH and quality. For census dataset, following 8 features with highest importance will be used - workclass, marital-status, occupation, race and capital-loss.
In this step two clustering algorithms implemented in section 5, will be reproduced your clustering experiments, but on the datasets on which dimensionality reduction algorithms were implemented in section 6.
model_stats = []
kclusters = range(2, 15)
for k in kclusters:
model = KMeans(n_clusters=k).fit(Wine_X_train_PCA)
t_start = time.time()*1000.0
predcls = model.predict(Wine_X_train_PCA)
t_stop = time.time()*1000.0
exectime = t_stop - t_start
modelid = "Wine_PCA_Kmeans"
centroids = model.cluster_centers_
sse = model.inertia_
adjmutinfo = adjusted_mutual_info_score(Wine_y_train,predcls)
adjrandinfo = adjusted_rand_score(Wine_y_train,predcls)
homogene, complete, vmeasure = homogeneity_completeness_v_measure(Wine_y_train,predcls)
model_stats.append([modelid, k, predcls, sse, centroids, adjmutinfo, adjrandinfo, homogene, complete, vmeasure, exectime])
kclusters = range(2, 15)
for k in kclusters:
model = KMeans(n_clusters=k).fit(Census_X_train_PCA)
t_start = time.time()*1000.0
model.fit(Census_X_train_PCA)
predcls = model.predict(Census_X_train_PCA)
t_stop = time.time()*1000.0
exectime = t_stop - t_start
modelid = "Census_PCA_Kmeans"
centroids = model.cluster_centers_
sse = model.inertia_
adjmutinfo = adjusted_mutual_info_score(Census_y_train,predcls)
adjrandinfo = adjusted_rand_score(Census_y_train,predcls)
homogene, complete, vmeasure = homogeneity_completeness_v_measure(Census_y_train,predcls)
model_stats.append([modelid, k, predcls, sse, centroids, adjmutinfo, adjrandinfo, homogene, complete, vmeasure, exectime])
kclusters = range(2, 15)
for k in kclusters:
model = KMeans(n_clusters=k).fit(Wine_X_train_ICA)
t_start = time.time()*1000.0
predcls = model.predict(Wine_X_train_ICA)
t_stop = time.time()*1000.0
exectime = t_stop - t_start
modelid = "Wine_ICA_Kmeans"
centroids = model.cluster_centers_
sse = model.inertia_
adjmutinfo = adjusted_mutual_info_score(Wine_y_train,predcls)
adjrandinfo = adjusted_rand_score(Wine_y_train,predcls)
homogene, complete, vmeasure = homogeneity_completeness_v_measure(Wine_y_train,predcls)
model_stats.append([modelid, k, predcls, sse, centroids, adjmutinfo, adjrandinfo, homogene, complete, vmeasure, exectime])
kclusters = range(2, 15)
for k in kclusters:
model = KMeans(n_clusters=k).fit(Census_X_train_ICA)
t_start = time.time()*1000.0
model.fit(Census_X_train_ICA)
predcls = model.predict(Census_X_train_ICA)
t_stop = time.time()*1000.0
exectime = t_stop - t_start
modelid = "Census_ICA_Kmeans"
centroids = model.cluster_centers_
sse = model.inertia_
adjmutinfo = adjusted_mutual_info_score(Census_y_train,predcls)
adjrandinfo = adjusted_rand_score(Census_y_train,predcls)
homogene, complete, vmeasure = homogeneity_completeness_v_measure(Census_y_train,predcls)
model_stats.append([modelid, k, predcls, sse, centroids, adjmutinfo, adjrandinfo, homogene, complete, vmeasure, exectime])
kclusters = range(2, 15)
for k in kclusters:
model = KMeans(n_clusters=k).fit(Wine_X_train_RP)
t_start = time.time()*1000.0
predcls = model.predict(Wine_X_train_RP)
t_stop = time.time()*1000.0
exectime = t_stop - t_start
modelid = "Wine_RP_Kmeans"
centroids = model.cluster_centers_
sse = model.inertia_
adjmutinfo = adjusted_mutual_info_score(Wine_y_train,predcls)
adjrandinfo = adjusted_rand_score(Wine_y_train,predcls)
homogene, complete, vmeasure = homogeneity_completeness_v_measure(Wine_y_train,predcls)
model_stats.append([modelid, k, predcls, sse, centroids, adjmutinfo, adjrandinfo, homogene, complete, vmeasure, exectime])
kclusters = range(2, 15)
for k in kclusters:
model = KMeans(n_clusters=k).fit(Census_X_train_RP)
t_start = time.time()*1000.0
model.fit(Census_X_train_RP)
predcls = model.predict(Census_X_train_RP)
t_stop = time.time()*1000.0
exectime = t_stop - t_start
modelid = "Census_RP_Kmeans"
centroids = model.cluster_centers_
sse = model.inertia_
adjmutinfo = adjusted_mutual_info_score(Census_y_train,predcls)
adjrandinfo = adjusted_rand_score(Census_y_train,predcls)
homogene, complete, vmeasure = homogeneity_completeness_v_measure(Census_y_train,predcls)
model_stats.append([modelid, k, predcls, sse, centroids, adjmutinfo, adjrandinfo, homogene, complete, vmeasure, exectime])
kclusters = range(2, 15)
for k in kclusters:
model = KMeans(n_clusters=k).fit(Wine_X_train_RF)
t_start = time.time()*1000.0
predcls = model.predict(Wine_X_train_RF)
t_stop = time.time()*1000.0
exectime = t_stop - t_start
modelid = "Wine_RF_Kmeans"
centroids = model.cluster_centers_
sse = model.inertia_
adjmutinfo = adjusted_mutual_info_score(Wine_y_train,predcls)
adjrandinfo = adjusted_rand_score(Wine_y_train,predcls)
homogene, complete, vmeasure = homogeneity_completeness_v_measure(Wine_y_train,predcls)
model_stats.append([modelid, k, predcls, sse, centroids, adjmutinfo, adjrandinfo, homogene, complete, vmeasure, exectime])
kclusters = range(2, 15)
for k in kclusters:
model = KMeans(n_clusters=k).fit(Census_X_train_RF)
t_start = time.time()*1000.0
model.fit(Census_X_train_RF)
predcls = model.predict(Census_X_train_RF)
t_stop = time.time()*1000.0
exectime = t_stop - t_start
modelid = "Census_RF_Kmeans"
centroids = model.cluster_centers_
sse = model.inertia_
adjmutinfo = adjusted_mutual_info_score(Census_y_train,predcls)
adjrandinfo = adjusted_rand_score(Census_y_train,predcls)
homogene, complete, vmeasure = homogeneity_completeness_v_measure(Census_y_train,predcls)
model_stats.append([modelid, k, predcls, sse, centroids, adjmutinfo, adjrandinfo, homogene, complete, vmeasure, exectime])
dfstatsKMDimRed = pd.DataFrame(model_stats, columns=['model_id', 'k', 'predcls', 'sse', 'centroids', 'adjmutinfo', 'adjrandinfo', 'homogene', 'complete', 'vmeasure', 'exectime'])
dfstatsKMDimRed
model_stats = []
kclusters = range(2, 15)
for k in kclusters:
model = GaussianMixture(n_components=k).fit(Wine_X_train_PCA)
t_start = time.time()*1000.0
predcls = model.predict(Wine_X_train_PCA)
t_stop = time.time()*1000.0
exectime = t_stop - t_start
modelid = "Wine_PCA_EM"
centroids = model.means_
loglike = model.score(Wine_X_train_PCA)
silscores = silhouette_samples(Wine_X_train_PCA, predcls)
adjmutinfo = adjusted_mutual_info_score(Wine_y_train,predcls)
adjrandinfo = adjusted_rand_score(Wine_y_train,predcls)
homogene, complete, vmeasure = homogeneity_completeness_v_measure(Wine_y_train,predcls)
aic = model.aic(Wine_X_train_PCA)
bic = model.bic(Wine_X_train_PCA)
model_stats.append([modelid, k, predcls, loglike, silscores, centroids, adjmutinfo, adjrandinfo, homogene, complete, vmeasure, aic, bic, exectime])
kclusters = range(2, 15)
for k in kclusters:
model = GaussianMixture(n_components=k, covariance_type='diag').fit(Census_X_train_PCA)
t_start = time.time()*1000.0
model.fit(Census_X_train_PCA)
predcls = model.predict(Census_X_train_PCA)
t_stop = time.time()*1000.0
exectime = t_stop - t_start
modelid = "Census_PCA_EM"
centroids = model.means_
loglike = model.score(Census_X_train_PCA)
silscores = silhouette_samples(Census_X_train_PCA, predcls)
adjmutinfo = adjusted_mutual_info_score(Census_y_train,predcls)
adjrandinfo = adjusted_rand_score(Census_y_train,predcls)
homogene, complete, vmeasure = homogeneity_completeness_v_measure(Census_y_train,predcls)
aic = model.aic(Census_X_train_PCA)
bic = model.bic(Census_X_train_PCA)
model_stats.append([modelid, k, predcls, loglike, silscores, centroids, adjmutinfo, adjrandinfo, homogene, complete, vmeasure, aic, bic, exectime])
kclusters = range(2, 15)
for k in kclusters:
model = GaussianMixture(n_components=k).fit(Wine_X_train_ICA)
t_start = time.time()*1000.0
predcls = model.predict(Wine_X_train_ICA)
t_stop = time.time()*1000.0
exectime = t_stop - t_start
modelid = "Wine_ICA_EM"
centroids = model.means_
loglike = model.score(Wine_X_train_ICA)
silscores = silhouette_samples(Wine_X_train_ICA, predcls)
adjmutinfo = adjusted_mutual_info_score(Wine_y_train,predcls)
adjrandinfo = adjusted_rand_score(Wine_y_train,predcls)
homogene, complete, vmeasure = homogeneity_completeness_v_measure(Wine_y_train,predcls)
aic = model.aic(Wine_X_train_ICA)
bic = model.bic(Wine_X_train_ICA)
model_stats.append([modelid, k, predcls, loglike, silscores, centroids, adjmutinfo, adjrandinfo, homogene, complete, vmeasure, aic, bic, exectime])
kclusters = range(2, 15)
for k in kclusters:
model = GaussianMixture(n_components=k, covariance_type='diag').fit(Census_X_train_ICA)
t_start = time.time()*1000.0
model.fit(Census_X_train_ICA)
predcls = model.predict(Census_X_train_ICA)
t_stop = time.time()*1000.0
exectime = t_stop - t_start
modelid = "Census_ICA_EM"
centroids = model.means_
loglike = model.score(Census_X_train_ICA)
silscores = silhouette_samples(Census_X_train_ICA, predcls)
adjmutinfo = adjusted_mutual_info_score(Census_y_train,predcls)
adjrandinfo = adjusted_rand_score(Census_y_train,predcls)
homogene, complete, vmeasure = homogeneity_completeness_v_measure(Census_y_train,predcls)
aic = model.aic(Census_X_train_ICA)
bic = model.bic(Census_X_train_ICA)
model_stats.append([modelid, k, predcls, loglike, silscores, centroids, adjmutinfo, adjrandinfo, homogene, complete, vmeasure, aic, bic, exectime])
kclusters = range(2, 15)
for k in kclusters:
model = GaussianMixture(n_components=k).fit(Wine_X_train_RP)
t_start = time.time()*1000.0
predcls = model.predict(Wine_X_train_RP)
t_stop = time.time()*1000.0
exectime = t_stop - t_start
modelid = "Wine_RP_EM"
centroids = model.means_
loglike = model.score(Wine_X_train_RP)
silscores = silhouette_samples(Wine_X_train_RP, predcls)
adjmutinfo = adjusted_mutual_info_score(Wine_y_train,predcls)
adjrandinfo = adjusted_rand_score(Wine_y_train,predcls)
homogene, complete, vmeasure = homogeneity_completeness_v_measure(Wine_y_train,predcls)
aic = model.aic(Wine_X_train_RP)
bic = model.bic(Wine_X_train_RP)
model_stats.append([modelid, k, predcls, loglike, silscores, centroids, adjmutinfo, adjrandinfo, homogene, complete, vmeasure, aic, bic, exectime])
kclusters = range(2, 15)
for k in kclusters:
model = GaussianMixture(n_components=k, covariance_type='diag').fit(Census_X_train_RP)
t_start = time.time()*1000.0
model.fit(Census_X_train_RP)
predcls = model.predict(Census_X_train_RP)
t_stop = time.time()*1000.0
exectime = t_stop - t_start
modelid = "Census_RP_EM"
centroids = model.means_
loglike = model.score(Census_X_train_RP)
silscores = silhouette_samples(Census_X_train_RP, predcls)
adjmutinfo = adjusted_mutual_info_score(Census_y_train,predcls)
adjrandinfo = adjusted_rand_score(Census_y_train,predcls)
homogene, complete, vmeasure = homogeneity_completeness_v_measure(Census_y_train,predcls)
aic = model.aic(Census_X_train_RP)
bic = model.bic(Census_X_train_RP)
model_stats.append([modelid, k, predcls, loglike, silscores, centroids, adjmutinfo, adjrandinfo, homogene, complete, vmeasure, aic, bic, exectime])
kclusters = range(2, 15)
for k in kclusters:
model = GaussianMixture(n_components=k).fit(Wine_X_train_RF)
t_start = time.time()*1000.0
predcls = model.predict(Wine_X_train_RF)
t_stop = time.time()*1000.0
exectime = t_stop - t_start
modelid = "Wine_RF_EM"
centroids = model.means_
loglike = model.score(Wine_X_train_RF)
silscores = silhouette_samples(Wine_X_train_RF, predcls)
adjmutinfo = adjusted_mutual_info_score(Wine_y_train,predcls)
adjrandinfo = adjusted_rand_score(Wine_y_train,predcls)
homogene, complete, vmeasure = homogeneity_completeness_v_measure(Wine_y_train,predcls)
aic = model.aic(Wine_X_train_RF)
bic = model.bic(Wine_X_train_RF)
model_stats.append([modelid, k, predcls, loglike, silscores, centroids, adjmutinfo, adjrandinfo, homogene, complete, vmeasure, aic, bic, exectime])
kclusters = range(2, 15)
for k in kclusters:
model = GaussianMixture(n_components=k, covariance_type='diag').fit(Census_X_train_RF)
t_start = time.time()*1000.0
model.fit(Census_X_train_RF)
predcls = model.predict(Census_X_train_RF)
t_stop = time.time()*1000.0
exectime = t_stop - t_start
modelid = "Census_RF_EM"
centroids = model.means_
loglike = model.score(Census_X_train_RF)
silscores = silhouette_samples(Census_X_train_RF, predcls)
adjmutinfo = adjusted_mutual_info_score(Census_y_train,predcls)
adjrandinfo = adjusted_rand_score(Census_y_train,predcls)
homogene, complete, vmeasure = homogeneity_completeness_v_measure(Census_y_train,predcls)
aic = model.aic(Census_X_train_RF)
bic = model.bic(Census_X_train_RF)
model_stats.append([modelid, k, predcls, loglike, silscores, centroids, adjmutinfo, adjrandinfo, homogene, complete, vmeasure, aic, bic, exectime])
dfstatsEMDimRed = pd.DataFrame(model_stats, columns=['model_id', 'k', 'predcls', 'loglike', 'silscore', 'centroids', 'adjmutinfo', 'adjrandinfo', 'homogene', 'complete', 'vmeasure', 'aic', 'bic', 'exectime'])
dfstatsEMDimRed
fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(15,15));
axs[0][0].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Wine_PCA_Kmeans']["adjmutinfo"], linestyle='-', marker='o', label = "Wine PCA KM")
axs[0][0].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Wine_PCA_EM']["adjmutinfo"], linestyle='-', marker='o', label = "Wine PCA EM")
axs[0][0].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Census_PCA_Kmeans']["adjmutinfo"], linestyle='-', marker='o', label = "Census PCA KM")
axs[0][0].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Census_PCA_EM']["adjmutinfo"], linestyle='-', marker='o', label = "Census PCA EM")
axs[0][0].set_title('Reduced Dimensionality: PCA - Adj Mutual Info')
axs[0][1].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Wine_ICA_Kmeans']["adjmutinfo"], linestyle='-', marker='o', label = "Wine ICA KM")
axs[0][1].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Wine_ICA_EM']["adjmutinfo"], linestyle='-', marker='o', label = "Wine ICA EM")
axs[0][1].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Census_ICA_Kmeans']["adjmutinfo"], linestyle='-', marker='o', label = "Census ICA KM")
axs[0][1].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Census_ICA_EM']["adjmutinfo"], linestyle='-', marker='o', label = "Census ICA EM")
axs[0][1].set_title('Reduced Dimensionality: ICA - Adj Mutual Info')
axs[1][0].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Wine_RP_Kmeans']["adjmutinfo"], linestyle='-', marker='o', label = "Wine RP KM")
axs[1][0].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Wine_RP_EM']["adjmutinfo"], linestyle='-', marker='o', label = "Wine RP EM")
axs[1][0].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Census_RP_Kmeans']["adjmutinfo"], linestyle='-', marker='o', label = "Census RP KM")
axs[1][0].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Census_RP_EM']["adjmutinfo"], linestyle='-', marker='o', label = "Census RP EM")
axs[1][0].set_title('Reduced Dimensionality: RP - Adj Mutual Info')
axs[1][1].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Wine_RF_Kmeans']["adjmutinfo"], linestyle='-', marker='o', label = "Wine RF KM")
axs[1][1].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Wine_RF_EM']["adjmutinfo"], linestyle='-', marker='o', label = "Wine RF EM")
axs[1][1].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Census_RF_Kmeans']["adjmutinfo"], linestyle='-', marker='o', label = "Census RF KM")
axs[1][1].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Census_RF_EM']["adjmutinfo"], linestyle='-', marker='o', label = "Census RF EM")
axs[1][1].set_title('Reduced Dimensionality: RF - Adj Mutual Info')
for ax in axs.flat:
ax.set(xlabel='# of Clusters', ylabel='Score')
ax.legend(loc='best')
ax.minorticks_on()
ax.grid(b=True, which='major', color='k', linestyle='-', alpha=0.1)
ax.grid(b=True, which='minor', color='r', linestyle='-', alpha=0.05)
fig.tight_layout()
fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(15,15));
axs[0][0].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Wine_PCA_Kmeans']["adjrandinfo"], linestyle='-', marker='o', label = "Wine PCA KM")
axs[0][0].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Wine_PCA_EM']["adjrandinfo"], linestyle='-', marker='o', label = "Wine PCA EM")
axs[0][0].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Census_PCA_Kmeans']["adjrandinfo"], linestyle='-', marker='o', label = "Census PCA KM")
axs[0][0].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Census_PCA_EM']["adjrandinfo"], linestyle='-', marker='o', label = "Census PCA EM")
axs[0][0].set_title('Reduced Dimensionality: PCA - Adj Random Info')
axs[0][1].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Wine_ICA_Kmeans']["adjrandinfo"], linestyle='-', marker='o', label = "Wine ICA KM")
axs[0][1].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Wine_ICA_EM']["adjrandinfo"], linestyle='-', marker='o', label = "Wine ICA EM")
axs[0][1].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Census_ICA_Kmeans']["adjrandinfo"], linestyle='-', marker='o', label = "Census ICA KM")
axs[0][1].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Census_ICA_EM']["adjrandinfo"], linestyle='-', marker='o', label = "Census ICA EM")
axs[0][1].set_title('Reduced Dimensionality: ICA - Adj Random Info')
axs[1][0].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Wine_RP_Kmeans']["adjrandinfo"], linestyle='-', marker='o', label = "Wine RP KM")
axs[1][0].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Wine_RP_EM']["adjrandinfo"], linestyle='-', marker='o', label = "Wine RP EM")
axs[1][0].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Census_RP_Kmeans']["adjrandinfo"], linestyle='-', marker='o', label = "Census RP KM")
axs[1][0].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Census_RP_EM']["adjrandinfo"], linestyle='-', marker='o', label = "Census RP EM")
axs[1][0].set_title('Reduced Dimensionality: RP - Adj Random Info')
axs[1][1].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Wine_RF_Kmeans']["adjrandinfo"], linestyle='-', marker='o', label = "Wine RF KM")
axs[1][1].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Wine_RF_EM']["adjrandinfo"], linestyle='-', marker='o', label = "Wine RF EM")
axs[1][1].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Census_RF_Kmeans']["adjrandinfo"], linestyle='-', marker='o', label = "Census RF KM")
axs[1][1].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Census_RF_EM']["adjrandinfo"], linestyle='-', marker='o', label = "Census RF EM")
axs[1][1].set_title('Reduced Dimensionality: RF - Adj Random Info')
for ax in axs.flat:
ax.set(xlabel='# of Clusters', ylabel='Score')
ax.legend(loc='best')
ax.minorticks_on()
ax.grid(b=True, which='major', color='k', linestyle='-', alpha=0.1)
ax.grid(b=True, which='minor', color='r', linestyle='-', alpha=0.05)
fig.tight_layout()
Oberservation: For wine dataset highest ARI score of 0.072 was for PCA at cluster size of 5 whereas ARI score of 0.155 was the highest for Census dataset for 4 FastICA clusters.
fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(15,15));
axs[0][0].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Wine_PCA_Kmeans']["vmeasure"], linestyle='-', marker='o', label = "Wine PCA KM")
axs[0][0].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Wine_PCA_EM']["vmeasure"], linestyle='-', marker='o', label = "Wine PCA EM")
axs[0][0].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Census_PCA_Kmeans']["vmeasure"], linestyle='-', marker='o', label = "Census PCA KM")
axs[0][0].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Census_PCA_EM']["vmeasure"], linestyle='-', marker='o', label = "Census PCA EM")
axs[0][0].set_title('Reduced Dimensionality: PCA - V Measure')
axs[0][1].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Wine_ICA_Kmeans']["vmeasure"], linestyle='-', marker='o', label = "Wine ICA KM")
axs[0][1].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Wine_ICA_EM']["vmeasure"], linestyle='-', marker='o', label = "Wine ICA EM")
axs[0][1].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Census_ICA_Kmeans']["vmeasure"], linestyle='-', marker='o', label = "Census ICA KM")
axs[0][1].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Census_ICA_EM']["vmeasure"], linestyle='-', marker='o', label = "Census ICA EM")
axs[0][1].set_title('Reduced Dimensionality: ICA - V Measure')
axs[1][0].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Wine_RP_Kmeans']["vmeasure"], linestyle='-', marker='o', label = "Wine RP KM")
axs[1][0].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Wine_RP_EM']["vmeasure"], linestyle='-', marker='o', label = "Wine RP EM")
axs[1][0].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Census_RP_Kmeans']["vmeasure"], linestyle='-', marker='o', label = "Census RP KM")
axs[1][0].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Census_RP_EM']["vmeasure"], linestyle='-', marker='o', label = "Census RP EM")
axs[1][0].set_title('Reduced Dimensionality: RP - V Measure')
axs[1][1].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Wine_RF_Kmeans']["vmeasure"], linestyle='-', marker='o', label = "Wine RF KM")
axs[1][1].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Wine_RF_EM']["vmeasure"], linestyle='-', marker='o', label = "Wine RF EM")
axs[1][1].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Census_RF_Kmeans']["vmeasure"], linestyle='-', marker='o', label = "Census RF KM")
axs[1][1].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Census_RF_EM']["vmeasure"], linestyle='-', marker='o', label = "Census RF EM")
axs[1][1].set_title('Reduced Dimensionality: RF - V Measure')
for ax in axs.flat:
ax.set(xlabel='# of Clusters', ylabel='Score')
ax.legend(loc='best')
ax.minorticks_on()
ax.grid(b=True, which='major', color='k', linestyle='-', alpha=0.1)
ax.grid(b=True, which='minor', color='r', linestyle='-', alpha=0.05)
fig.tight_layout()
Oberservation: Across all dimensionality reduction techniques, V-Measure score for wine dataset was lower than that for Census dataset. Highest V-Measure score for wine dataset was for FastICA model for 5 clusters. For census dataset, Gaussian mixture model implemented on the Random Forest classifier gave the highest V-Measure score of 0.16 for 5 clusters.
fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(15,15));
axs[0][0].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Wine_PCA_Kmeans']["exectime"], linestyle='-', marker='o', label = "Wine PCA KM")
axs[0][0].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Wine_PCA_EM']["exectime"], linestyle='-', marker='o', label = "Wine PCA EM", alpha=0.5)
axs[0][0].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Census_PCA_Kmeans']["exectime"], linestyle='-', marker='o', label = "Census PCA KM")
axs[0][0].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Census_PCA_EM']["exectime"], linestyle='-', marker='o', label = "Census PCA EM")
axs[0][0].set_title('Reduced Dimensionality: PCA - Execution Time')
axs[0][1].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Wine_ICA_Kmeans']["exectime"], linestyle='-', marker='o', label = "Wine ICA KM")
axs[0][1].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Wine_ICA_EM']["exectime"], linestyle='-', marker='o', label = "Wine ICA EM", alpha=0.5)
axs[0][1].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Census_ICA_Kmeans']["exectime"], linestyle='-', marker='o', label = "Census ICA KM")
axs[0][1].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Census_ICA_EM']["exectime"], linestyle='-', marker='o', label = "Census ICA EM")
axs[0][1].set_title('Reduced Dimensionality: ICA - Execution Time')
axs[1][0].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Wine_RP_Kmeans']["exectime"], linestyle='-', marker='o', label = "Wine RP KM")
axs[1][0].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Wine_RP_EM']["exectime"], linestyle='-', marker='o', label = "Wine RP EM", alpha=0.5)
axs[1][0].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Census_RP_Kmeans']["exectime"], linestyle='-', marker='o', label = "Census RP KM")
axs[1][0].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Census_RP_EM']["exectime"], linestyle='-', marker='o', label = "Census RP EM")
axs[1][0].set_title('Reduced Dimensionality: RP - Execution Time')
axs[1][1].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Wine_RF_Kmeans']["exectime"], linestyle='-', marker='o', label = "Wine RF KM")
axs[1][1].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Wine_RF_EM']["exectime"], linestyle='-', marker='o', label = "Wine RF EM", alpha=0.5)
axs[1][1].plot(kclusters, dfstatsKMDimRed.loc[dfstatsKMDimRed['model_id'] == 'Census_RF_Kmeans']["exectime"], linestyle='-', marker='o', label = "Census RF KM")
axs[1][1].plot(kclusters, dfstatsEMDimRed.loc[dfstatsEMDimRed['model_id'] == 'Census_RF_EM']["exectime"], linestyle='-', marker='o', label = "Census RF EM")
axs[1][1].set_title('Reduced Dimensionality: RF - Execution Time')
for ax in axs.flat:
ax.set(xlabel='# of Clusters', ylabel='Time')
ax.legend(loc='best')
ax.minorticks_on()
ax.grid(b=True, which='major', color='k', linestyle='-', alpha=0.1)
ax.grid(b=True, which='minor', color='r', linestyle='-', alpha=0.05)
fig.tight_layout()
Oberservation: For all four dimensionality reduction techniques, overall execution time for Wine dataset was close to zero where execution time for Census dataset incresed exponentially with the increased in # of clusters. This may be due to size of the Census dataset. For Census dataset EM was computationally efficient as compared to Kmeans algorithm.
In this section dimensionality reduction algorithms are applied to census datasets and supervised neural network classifier wes re-run on the newly projected data.
size = 10
cv = StratifiedKFold(size, shuffle=True, random_state=1)
Census_mlpclf = MLPClassifier(hidden_layer_sizes=57, learning_rate_init=0.01, activation='tanh', random_state=1)
t_start = time.time()*1000.0
Census_mlpclf = Census_mlpclf.fit(Census_X_train, Census_y_train)
t_stop = time.time()*1000.0
traintime = t_stop - t_start
t_start = time.time()*1000.0
Census_y_pred = (Census_mlpclf.predict_proba(Census_X_test)[:,1] >= 0.4).astype(bool)
t_stop = time.time()*1000.0
testtime = t_stop - t_start
cnf_matrix = confusion_matrix(Census_y_test, Census_y_pred)
np.set_printoptions(precision=2)
# Plot non-normalized confusion matrix
plt.figure()
utils.plot_confusion_matrix(cnf_matrix, classes=[0,1], title='Original - Census')
modelid = "Original - Census"
prscore = precision_score(Census_y_test, Census_y_pred, average='binary')
rcscore = recall_score(Census_y_test, Census_y_pred, average='binary')
f1score = f1_score(Census_y_test, Census_y_pred, average='binary')
model_stats.append([modelid, prscore, rcscore, f1score, traintime, testtime])
print("Precision: "+str("{:.2f}".format(prscore))+", Recall: "+str("{:.2f}".format(rcscore))+", F1 Score: "+str("{:.2f}".format(f1score))+", Train Time: "+str("{:.2f}".format(traintime))+", Test Time: "+str("{:.2f}".format(testtime)))
utils.plot_pr_curve(Census_mlpclf, Census_X_test, Census_y_test, "Original - Census - PR Curve", False)
Census_mlpclf = MLPClassifier(hidden_layer_sizes=57, learning_rate_init=0.01, activation='tanh', random_state=1)
t_start = time.time()*1000.0
Census_mlpclf = Census_mlpclf.fit(Census_X_train_PCA, Census_y_train)
t_stop = time.time()*1000.0
traintime = t_stop - t_start
t_start = time.time()*1000.0
Census_y_pred = (Census_mlpclf.predict_proba(Census_X_test_PCA)[:,1] >= 0.4).astype(bool)
t_stop = time.time()*1000.0
testtime = t_stop - t_start
cnf_matrix = confusion_matrix(Census_y_test, Census_y_pred)
np.set_printoptions(precision=2)
# Plot non-normalized confusion matrix
plt.figure()
utils.plot_confusion_matrix(cnf_matrix, classes=[0,1], title='PCA - Census')
modelid = "PCA - Census"
prscore = precision_score(Census_y_test, Census_y_pred, average='binary')
rcscore = recall_score(Census_y_test, Census_y_pred, average='binary')
f1score = f1_score(Census_y_test, Census_y_pred, average='binary')
model_stats.append([modelid, prscore, rcscore, f1score, traintime, testtime])
print("Precision: "+str("{:.2f}".format(prscore))+", Recall: "+str("{:.2f}".format(rcscore))+", F1 Score: "+str("{:.2f}".format(f1score))+", Train Time: "+str("{:.2f}".format(traintime))+", Test Time: "+str("{:.2f}".format(testtime)))
utils.plot_pr_curve(Census_mlpclf, Census_X_test_PCA, Census_y_test, "PCA - Census - PR Curve", False)
Census_mlpclf = MLPClassifier(hidden_layer_sizes=57, learning_rate_init=0.01, activation='tanh', random_state=1)
t_start = time.time()*1000.0
Census_mlpclf = Census_mlpclf.fit(Census_X_train_ICA, Census_y_train)
t_stop = time.time()*1000.0
traintime = t_stop - t_start
t_start = time.time()*1000.0
Census_y_pred = (Census_mlpclf.predict_proba(Census_X_test_ICA)[:,1] >= 0.4).astype(bool)
t_stop = time.time()*1000.0
testtime = t_stop - t_start
cnf_matrix = confusion_matrix(Census_y_test, Census_y_pred)
np.set_printoptions(precision=2)
# Plot non-normalized confusion matrix
plt.figure()
utils.plot_confusion_matrix(cnf_matrix, classes=[0,1], title='ICA - Census')
modelid = "ICA - Census"
prscore = precision_score(Census_y_test, Census_y_pred, average='binary')
rcscore = recall_score(Census_y_test, Census_y_pred, average='binary')
f1score = f1_score(Census_y_test, Census_y_pred, average='binary')
model_stats.append([modelid, prscore, rcscore, f1score, traintime, testtime])
print("Precision: "+str("{:.2f}".format(prscore))+", Recall: "+str("{:.2f}".format(rcscore))+", F1 Score: "+str("{:.2f}".format(f1score))+", Train Time: "+str("{:.2f}".format(traintime))+", Test Time: "+str("{:.2f}".format(testtime)))
utils.plot_pr_curve(Census_mlpclf, Census_X_test_ICA, Census_y_test, "ICA - Census - PR Curve", False)
Census_mlpclf = MLPClassifier(hidden_layer_sizes=57, learning_rate_init=0.01, activation='tanh', random_state=1)
t_start = time.time()*1000.0
Census_mlpclf = Census_mlpclf.fit(Census_X_train_RP, Census_y_train)
t_stop = time.time()*1000.0
traintime = t_stop - t_start
t_start = time.time()*1000.0
Census_y_pred = (Census_mlpclf.predict_proba(Census_X_test_RP)[:,1] >= 0.4).astype(bool)
t_stop = time.time()*1000.0
testtime = t_stop - t_start
cnf_matrix = confusion_matrix(Census_y_test, Census_y_pred)
np.set_printoptions(precision=2)
# Plot non-normalized confusion matrix
plt.figure()
utils.plot_confusion_matrix(cnf_matrix, classes=[0,1], title='RP - Census')
modelid = "RP - Census"
prscore = precision_score(Census_y_test, Census_y_pred, average='binary')
rcscore = recall_score(Census_y_test, Census_y_pred, average='binary')
f1score = f1_score(Census_y_test, Census_y_pred, average='binary')
model_stats.append([modelid, prscore, rcscore, f1score, traintime, testtime])
print("Precision: "+str("{:.2f}".format(prscore))+", Recall: "+str("{:.2f}".format(rcscore))+", F1 Score: "+str("{:.2f}".format(f1score))+", Train Time: "+str("{:.2f}".format(traintime))+", Test Time: "+str("{:.2f}".format(testtime)))
utils.plot_pr_curve(Census_mlpclf, Census_X_test_RP, Census_y_test, "RP - Census - PR Curve", False)
Census_mlpclf = MLPClassifier(hidden_layer_sizes=57, learning_rate_init=0.01, activation='tanh', random_state=1)
t_start = time.time()*1000.0
Census_mlpclf = Census_mlpclf.fit(Census_X_train_RF, Census_y_train)
t_stop = time.time()*1000.0
traintime = t_stop - t_start
t_start = time.time()*1000.0
Census_y_pred = (Census_mlpclf.predict_proba(Census_X_test_RF)[:,1] >= 0.4).astype(bool)
t_stop = time.time()*1000.0
testtime = t_stop - t_start
cnf_matrix = confusion_matrix(Census_y_test, Census_y_pred)
np.set_printoptions(precision=2)
# Plot non-normalized confusion matrix
plt.figure()
utils.plot_confusion_matrix(cnf_matrix, classes=[0,1], title='RF - Census')
modelid = "RF - Census"
prscore = precision_score(Census_y_test, Census_y_pred, average='binary')
rcscore = recall_score(Census_y_test, Census_y_pred, average='binary')
f1score = f1_score(Census_y_test, Census_y_pred, average='binary')
model_stats.append([modelid, prscore, rcscore, f1score, traintime, testtime])
print("Precision: "+str("{:.2f}".format(prscore))+", Recall: "+str("{:.2f}".format(rcscore))+", F1 Score: "+str("{:.2f}".format(f1score))+", Train Time: "+str("{:.2f}".format(traintime))+", Test Time: "+str("{:.2f}".format(testtime)))
utils.plot_pr_curve(Census_mlpclf, Census_X_test_RF, Census_y_test, "RF - Census - PR Curve", False)
Observation: Analysis of four dimensionality reduction algorithms reveals that performance neural network algorithm on the datasets reduced using PCA and Random Forest slightly out performances the neural network on the original dataset. Performance of both dimensionality reduction techniques is very close. Performance on the dataset reduced using ICA was worst and much lower than performance on the original dataset.
In this section neural network was implemented on the clustered dataset. Clustered census dataset represents each cluster as a dataset feature which leads to reduced dimensions. Both clustering algorithms will be find implemented to identify optimal value of K. Then Neural network algorithm will be tuned to optimize performance.
model_stats = []
kclusters = range(2, 15)
for k in kclusters:
model = KMeans(n_clusters=k)
model.fit(Census_X_train)
Census_X_train_kmeans = model.transform(Census_X_train)
Census_X_test_kmeans = model.transform(Census_X_test)
Census_mlpclf = MLPClassifier(hidden_layer_sizes=57, learning_rate_init=0.01, activation='tanh', random_state=1)
t_start = time.time()*1000.0
Census_mlpclf = Census_mlpclf.fit(Census_X_train_kmeans, Census_y_train)
t_stop = time.time()*1000.0
traintime = t_stop - t_start
t_start = time.time()*1000.0
Census_y_pred = (Census_mlpclf.predict_proba(Census_X_test_kmeans)[:,1] >= 0.4).astype(bool)
t_stop = time.time()*1000.0
testtime = t_stop - t_start
cnf_matrix = confusion_matrix(Census_y_test, Census_y_pred)
np.set_printoptions(precision=2)
modelid = "Kmeans - Census"
prscore = precision_score(Census_y_test, Census_y_pred, average='binary')
rcscore = recall_score(Census_y_test, Census_y_pred, average='binary')
f1score = f1_score(Census_y_test, Census_y_pred, average='binary')
model_stats.append([k, modelid, prscore, rcscore, f1score, traintime, testtime])
print("K: "+str(k)+", Precision: "+str("{:.2f}".format(prscore))+", Recall: "+str("{:.2f}".format(rcscore))+", F1 Score: "+str("{:.2f}".format(f1score))+", Train Time: "+str("{:.2f}".format(traintime))+", Test Time: "+str("{:.2f}".format(testtime)))
model_stats = []
kclusters = range(2, 15)
for k in kclusters:
model = GaussianMixture(n_components=k, covariance_type='diag')
model.fit(Census_X_train)
Census_X_train_EM = model.predict_proba(Census_X_train)
Census_X_test_EM = model.predict_proba(Census_X_test)
Census_mlpclf = MLPClassifier(hidden_layer_sizes=57, learning_rate_init=0.01, activation='tanh', random_state=1)
t_start = time.time()*1000.0
Census_mlpclf = Census_mlpclf.fit(Census_X_train_EM, Census_y_train)
t_stop = time.time()*1000.0
traintime = t_stop - t_start
t_start = time.time()*1000.0
Census_y_pred = (Census_mlpclf.predict_proba(Census_X_test_EM)[:,1] >= 0.4).astype(bool)
t_stop = time.time()*1000.0
testtime = t_stop - t_start
cnf_matrix = confusion_matrix(Census_y_test, Census_y_pred)
np.set_printoptions(precision=2)
# Plot non-normalized confusion matrix
# plt.figure()
# utils.plot_confusion_matrix(cnf_matrix, classes=[0,1], title='Kmeans - Census')
modelid = "EM - Census"
prscore = precision_score(Census_y_test, Census_y_pred, average='binary')
rcscore = recall_score(Census_y_test, Census_y_pred, average='binary')
f1score = f1_score(Census_y_test, Census_y_pred, average='binary')
model_stats.append([k, modelid, prscore, rcscore, f1score, traintime, testtime])
print("K: "+str(k)+", Precision: "+str("{:.2f}".format(prscore))+", Recall: "+str("{:.2f}".format(rcscore))+", F1 Score: "+str("{:.2f}".format(f1score))+", Train Time: "+str("{:.2f}".format(traintime))+", Test Time: "+str("{:.2f}".format(testtime)))
Observations: Both K-means and GMM algorithms are executed for multiple times to measure Precision, Recall, F1 Score, Train Time and Test Time. Based on highest F1 score optimal value of clusters was chosen. For K-means algorithm is optimal cluster count was 8. The optimal numbers of components for Gaussian Mixture Model algorithm was 7. It was observed that for K-means model multiple Ks had highest F1 Score, cluster count of 8 was chosen because it has lowest Train and Test Time.
Some of neural network parameters tuned during implementation of supervised learning algorithm were re-tuned during this step. Optimal value of activation was identified as logistic and that for learning rate was 0.003. Optimization was carried out using GridSearchCV method with 10-fold stratified cross validation.
model = KMeans(n_clusters=8)
model.fit(Census_X_train)
Census_X_train_kmeans = model.transform(Census_X_train)
Census_X_test_kmeans = model.transform(Census_X_test)
Census_mlpclf = MLPClassifier(hidden_layer_sizes=57, learning_rate_init=0.01, activation='tanh', random_state=1)
Census_mlpclf = Census_mlpclf.fit(Census_X_train_kmeans, Census_y_train)
# Census_mlpclf = MLPClassifier(hidden_layer_sizes=57, learning_rate_init=0.01, activation='tanh', random_state=1)
Census_mlpclf = MLPClassifier(hidden_layer_sizes=57, random_state=1)
activation =['identity', 'logistic', 'tanh', 'relu']
learning_rates = np.arange(0.0, 0.005, 0.0005)
param_grid = {'learning_rate_init': learning_rates, 'activation': activation}
Census_mlpgs= GridSearchCV(Census_mlpclf,param_grid, cv=cv, scoring='f1', n_jobs=-1)
Census_mlpgs.fit(Census_X_train_kmeans,Census_y_train)
print("Best Score:" + str(Census_mlpgs.best_score_))
print("Best Parameters: " + str(Census_mlpgs.best_params_))
print("Best Estimator: " + str(Census_mlpgs.best_estimator_))
Census_mlpclf = MLPClassifier(activation='logistic', alpha=0.0001, batch_size='auto',
beta_1=0.9, beta_2=0.999, early_stopping=False, epsilon=1e-08,
hidden_layer_sizes=57, learning_rate='constant',
learning_rate_init=0.004, max_fun=15000, max_iter=200,
momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
power_t=0.5, random_state=1, shuffle=True, solver='adam',
tol=0.0001, validation_fraction=0.1, verbose=False,
warm_start=False)
#plot learning curve
utils.plot_learning_curve(Census_mlpclf, "MLP - Census - Learning Curve", Census_X_train_kmeans, Census_y_train, cv=cv, n_jobs=-1, exp_score=0.7, train_sizes=np.linspace(0.05, 1.0, 10), ylim=[0.5,1])
t_start = time.time()*1000.0
Census_mlpclf = Census_mlpclf.fit(Census_X_train_kmeans, Census_y_train)
t_stop = time.time()*1000.0
traintime = t_stop - t_start
t_start = time.time()*1000.0
Census_y_pred = (Census_mlpclf.predict_proba(Census_X_test_kmeans)[:,1] >= 0.4).astype(bool)
t_stop = time.time()*1000.0
testtime = t_stop - t_start
cnf_matrix = confusion_matrix(Census_y_test, Census_y_pred)
np.set_printoptions(precision=2)
# Plot non-normalized confusion matrix
plt.figure()
utils.plot_confusion_matrix(cnf_matrix, classes=[0,1], title='Kmeans - Census')
modelid = "Kmeans - Census"
prscore = precision_score(Census_y_test, Census_y_pred, average='binary')
rcscore = recall_score(Census_y_test, Census_y_pred, average='binary')
f1score = f1_score(Census_y_test, Census_y_pred, average='binary')
model_stats.append([k, modelid, prscore, rcscore, f1score, traintime, testtime])
print("K: "+str(k)+", Precision: "+str("{:.2f}".format(prscore))+", Recall: "+str("{:.2f}".format(rcscore))+", F1 Score: "+str("{:.2f}".format(f1score))+", Train Time: "+str("{:.2f}".format(traintime))+", Test Time: "+str("{:.2f}".format(testtime)))
utils.plot_pr_curve(Census_mlpclf, Census_X_test_kmeans, Census_y_test, "Kmeans - Census - PR Curve", False)
Some of neural network parameters tuned during implementation of supervised learning algorithm were re-tuned during this step. Optimal value of activation was identified as logistic and that for learning rate was 0.003. Optimization was carried out using GridSearchCV method with 10-fold stratified cross validation.
model = GaussianMixture(n_components=7, covariance_type='diag')
model.fit(Census_X_train)
Census_X_train_EM = model.predict_proba(Census_X_train)
Census_X_test_EM = model.predict_proba(Census_X_test)
# Census_mlpclf = MLPClassifier(hidden_layer_sizes=57, learning_rate_init=0.01, activation='tanh', random_state=1)
Census_mlpclf = MLPClassifier(hidden_layer_sizes=57, random_state=1)
activation =['identity', 'logistic', 'tanh', 'relu']
learning_rates = np.arange(0.0, 0.005, 0.0005)
param_grid = {'learning_rate_init': learning_rates, 'activation': activation}
Census_mlpgs= GridSearchCV(Census_mlpclf,param_grid, cv=cv, scoring='f1', n_jobs=-1)
Census_mlpgs.fit(Census_X_train_EM,Census_y_train)
print("Best Score:" + str(Census_mlpgs.best_score_))
print("Best Parameters: " + str(Census_mlpgs.best_params_))
print("Best Estimator: " + str(Census_mlpgs.best_estimator_))
Census_mlpclf = MLPClassifier(activation='logistic', alpha=0.0001, batch_size='auto',
beta_1=0.9, beta_2=0.999, early_stopping=False, epsilon=1e-08,
hidden_layer_sizes=57, learning_rate='constant',
learning_rate_init=0.003, max_fun=15000, max_iter=200,
momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
power_t=0.5, random_state=1, shuffle=True, solver='adam',
tol=0.0001, validation_fraction=0.1, verbose=False,
warm_start=False)
#plot learning curve
utils.plot_learning_curve(Census_mlpclf, "KNN (Itr 1) - Census - Learning Curve", Census_X_train_EM, Census_y_train, cv=cv, n_jobs=-1, exp_score=0.7, train_sizes=np.linspace(0.05, 1.0, 10), ylim=[0.1,1])
Census_mlpclf = Census_mlpclf.fit(Census_X_train_EM, Census_y_train)
t_start = time.time()*1000.0
Census_mlpclf = Census_mlpclf.fit(Census_X_train_EM, Census_y_train)
t_stop = time.time()*1000.0
traintime = t_stop - t_start
t_start = time.time()*1000.0
Census_y_pred = (Census_mlpclf.predict_proba(Census_X_test_EM)[:,1] >= 0.4).astype(bool)
t_stop = time.time()*1000.0
testtime = t_stop - t_start
cnf_matrix = confusion_matrix(Census_y_test, Census_y_pred)
np.set_printoptions(precision=2)
# Plot non-normalized confusion matrix
plt.figure()
utils.plot_confusion_matrix(cnf_matrix, classes=[0,1], title='EM - Census')
modelid = "EM - Census"
prscore = precision_score(Census_y_test, Census_y_pred, average='binary')
rcscore = recall_score(Census_y_test, Census_y_pred, average='binary')
f1score = f1_score(Census_y_test, Census_y_pred, average='binary')
model_stats.append([k, modelid, prscore, rcscore, f1score, traintime, testtime])
print("K: "+str(k)+", Precision: "+str("{:.2f}".format(prscore))+", Recall: "+str("{:.2f}".format(rcscore))+", F1 Score: "+str("{:.2f}".format(f1score))+", Train Time: "+str("{:.2f}".format(traintime))+", Test Time: "+str("{:.2f}".format(testtime)))
utils.plot_pr_curve(Census_mlpclf, Census_X_test_EM, Census_y_test, "EM - Census - PR Curve", False)
Observation: Neural Network trained on the dataset clustered using Gaussian Mixture Model performed very poorly as compared performance of the neural network on the original dataset. Test and Train time on this clustered dataset was much lesser than that on the original dataset. This reduction in the execution time is due to reduced number of features.
Looking at the performance of the clustering and dimensionality reduction techniques, it can be concluded that for selected datasets dimensionality reduction techniques performed better as compared to clustering methods. But when compared to original dataset, performance on dimensionally reduced dataset was not substantially better. One of the reasons for this could be the structure of the dataset. Both of the chosen datasets had less than 15 attributes. Although performance improvement was not substantial for PCA and Random Forest but there was substantial difference in the training and testing time. Hence dimensionality reduction techniques should not only be looked at for performance improvement but they can help in reducing execution time.